Blogs

The Art of Equivalence

Metadata is the backbone of video delivery workflows. However, acquiring and consolidating relevant data for content libraries poses a unique and complex set of challenges. A single data source often proves insufficient in providing all the necessary data, leading to the need for metadata acquisition from multiple internal and external sources, each with its own unique ID for associated content records. The intricate process of merging these records into a single master record that accurately represents the data associated with each unique ID is called equivalence.

Equivalence is a nuanced art that demands the ability to map content records from any source, validate broadcast IDs, ensure consistent formats, and de-duplicate repetitive records. At its core, equivalence involves identifying, matching, and linking data from different sources. However, this definition fails to capture the actual value of the tools and rules honed over MetaBroadcast’s 10+ years of managing metadata. Our extensive experience mapping content records from numerous sources brings a wealth of unique insight and expertise to the equivalence process, instilling a strong sense of confidence in our customers.

Equivalence starts with a subject or content record. At a minimum, this record reflects the data provider’s unique content ID, program title, and release date. While ID matching tends to be very precise because unique identifiers are specifically assigned to each record, reducing the likelihood of false positives, it solely depends on the availability of identifiers. MetaBroadcast’s repository of over 140 million MetaBroadcast IDs (MBID) reflects content records that have been consolidated and matched from various sources. Each MBID reflects a consolidated content record reflecting the unique IDs assigned by each source.  When a customer is ingesting data from commercial metadata providers, there is a high probability that MetaBroadcast already has a content record reflecting a matching ID – simplifying the equivalence process. 

However, mapping content records is more complex than assessing if two records share the same unique ID. Unlike other industries, the media and broadcasting industry lacks universally adopted identification standards. Each studio, broadcaster, or content provider uses its own set of identifiers, leading to fragmentation and interoperability issues. Subsequently, equivalence often requires leveraging several data fields while applying rules and scoring logic.

While title matching is often considered an acceptable foundation for equivalence, titles can vary significantly in format, language, spelling, and punctuation. For example, a movie title might vary due to regional differences, alternate spellings, abbreviations, or special characters. Or, two movies can simply share the same title (e.g., Bad Boys, released in 1983 and starring Sean Penn vs. Bad Boys, released in 1995 and starring Will Smith). Managing variability and ensuring accurate matching has influenced MetaBroadcast’s equivalence process, which incorporates data fields for title, release date, cast, director, character names, description, series & episode number, duration, and more. 

MetaBroadcast’s automated equivalence process uses fuzzy matching and hard or soft rules to identify candidates and score relationships between subject and candidate content records that may be considered for matching. If the candidate scores high enough, the two become linked, and an MBID is associated with the merged record. 

It is important to note that MetaBroadcast is data agnostic. Customers using Atlas can define their sources and establish precedence for their preferred sources. This means that upon completion of the equivalence process, data from those prioritised sources will be used to populate data fields in the merged record. A typical video service provider will merge content records from both their internal platforms and defined external sources. In contrast, audience measurement firms merge records from multiple broadcasters, streaming services, and census providers. The definition and selection of metadata sources are dictated by the data fields (e.g., images, schedule data, ratings, reviews, etc.) most important to each customer. Our active metadata platform Atlas gives customers visibility of the merged data record, identifying data provenance and reflecting the customer’s defined data schema and precedence of sources. 

Equivalence is not a one-time task but an ongoing, iterative process. Once data is ingested and cleansed, exceptions or errors may be identified. This is usually due to poor data quality or a need for filtering to limit certain types of data ingest. However, our vast experience allows us to quickly trigger a library of rules to refine the equivalence process, providing continuous reassurance of data quality improvement and reinforcing our ongoing commitment to excellence. 

As a result of aggregating metadata and performing equivalence for our customers, we have built a repository of master MBIDs and their associated content records. This repository, a testament to our commitment to data management, reflects data ingested, cleansed, and equived from a wide range of sources selected by our customers. The records have been persistently updated as existing data changes or new data becomes available. The insight we can derive from this repository of over 140 million records is not just significant; it’s immeasurable, underlining the immense value it brings to our customers and their data management processes.